The pan-genome and local adaptation of Arabidopsis thaliana

Arabidopsis thaliana serves as a model species for investigating various aspects of plant biology. However, the contribution of genomic structural variations (SVs) and their associate genes to the local adaptation of this widely distribute species remains unclear. Here, we de novo assemble chromosome-level genomes of 32 A. thaliana ecotypes and determine that variable genes expand the gene pool in different ecotypes and thus assist local adaptation. We develop a graph-based pan-genome and identify 61,332 SVs that overlap with 18,883 genes, some of which are highly involved in ecological adaptation of this species. For instance, we observe a specific 332 bp insertion in the promoter region of the HPCA1 gene in the Tibet-0 ecotype that enhances gene expression, thereby promotes adaptation to alpine environments. These findings augment our understanding of the molecular mechanisms underlying the local adaptation of A. thaliana across diverse habitats.

Supplementary Fig. 1. Estimation of Col-0 genome size by K-mer analysis. The figure shows the frequency of 17 k-mers, which are 17 bp sequences from clean reads of short-insertsize libraries. We identified 6,471,685,116 K-mers and the peak of K-mer depth is 47. Genome size can be estimated as (total K-mer number) / (the volume peak). The genome size of Col-0 was thus estimated as 137.70 Mb.
Supplementary Fig. 4. Comparison of genes different from Araport11 between relict ecotypes and non-relict ecotypes. Significance tested by two tailed Wilcoxon method with p = 5.6e-5 < 0.05. The middle line of the boxplot is the median, the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge, the outliers are removed. Source data are provided as a Source Data file. The middle line of the boxplot is the median, the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge, the outliers are removed. Significance tested by two tailed Wilcoxon method (p = 0, 5.143244e-164 and 5.944438e-289).
Supplementary Fig. 15. The expression level of genes with different TE types' insertion.
The middle line of the boxplot is the median, the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge, the outliers are removed. Significance two tailed tested by Wilcoxon method (p = 7.532357e-280 and 7.301108e-126). Supplementary Fig. 19. The expression level of genes with SV overlapped in different regions. The middle line of the boxplot is the median, the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge, the outliers are removed. Significance tested by two tailed Wilcoxon method (p = 2.933524e-76).
Supplementary Fig. 20. The expression level of genes overlapped with different SV types. The middle line of the boxplot is the median, the lower and upper hinges correspond to the first and third quartiles, the upper whisker extends from the hinge to the largest value no further than 1.5 × IQR from the hinge (where IQR is the inter-quartile range) and the lower whisker extends from the hinge to the smallest value at most 1.5 × IQR of the hinge, the outliers are removed. Significance tested by two tailed Wilcoxon method.